{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f4b62051",
   "metadata": {},
   "source": [
    "This cross-service notebook walks you through the process of using Amazon Textract's DetectDocumentText API to extract text from a JPG/JPEG/PNG file containing text, and then using Amazon Comprehend's DetectEntities API to find entities in the extracted text."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93c62ecb",
   "metadata": {},
   "source": [
    "You can run this notebook using AWS SageMaker."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8496bef4",
   "metadata": {},
   "source": [
    "### To run this notebook using SageMaker\n",
    "\n",
    " 1. Navigate to Amazon SageMaker from the AWS Management Console. From the SageMaker dashboard, select Notebook and then choose \"Notebook instances\".\n",
    " 2. Select \"Create notebook instance\".\n",
    " 3. Select the optional \"Git repositories\" menu, give the notebook instance a name, and choose the option to \"Clone a public Git repository to this notebook instance only\". \n",
    " 4. In the \"Git repository URL\" box, paste in the URL for this repository (https://github.com/awsdocs/aws-doc-sdk-examples).\n",
    " 5. Select \"Create notebook instance\".\n",
    " 6. You will need to add some policies to your Amazon SageMaker role so it can access Amazon Simple Storage Service (Amazon S3), Amazon Textract, and Amazon Comprehend. You can add the following policies to your role: `AmazonTextractFullAccess`, `ComprehendFullAccess`, `AmazonS3ReadOnlyAccess`. \n",
    " 7. After the notebook instance has been created, select it from the list of notebooks. \n",
    " 8. Choose \"Open Jupyter\". After the Jupyter notebook has started, navigate to the directory containing this notebook and select this notebook to run it."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96c58983",
   "metadata": {},
   "source": [
    "In order to make use of the AWS SDK for Python (Boto 3) in this notebook, you will need to configure your AWS credentials. The `GetPass` library allows you to input your AWS credentials and keep them unexposed. When you run the code cell below, copy the value of your AWS access key into the first text box and the value of your AWS secret key into the second text box."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "faf8e08d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n",
    "# SPDX-License-Identifier: Apache-2.0\n",
    "\n",
    "import getpass\n",
    "access_key = getpass.getpass()\n",
    "secret_key = getpass.getpass()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a99c2e4",
   "metadata": {},
   "source": [
    "After setting your security credentials, import any other libraries you need. Set the name of both the Amazon S3 bucket you have your image in and the name of the image itself. In the code below, replace the value of \"bucket\" with the name of your bucket, replace the value of \"document\" with the name of the image file you want to analyze, and replace the value of \"region_name\" with the name of the Region you are operating in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f869f525",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import io\n",
    "from PIL import Image               \n",
    "from IPython.display import display \n",
    "import json\n",
    "import pandas as pd\n",
    "import os\n",
    "\n",
    "bucket = 'DOC-EXAMPLE-BUCKET'\n",
    "document = 'Name of your document'\n",
    "region_name = 'Name of your region'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d8fb2e2",
   "metadata": {},
   "source": [
    "Create a function that connects to both Amazon S3 and Amazon Textract via Boto3. The function presented in the following code starts by connecting to the Amazon S3 resource and retrieving the image you specified from the bucket you specified. The function then connects to Amazon Textract and calls the DetectDocumentText API to extract the text in the image. The lines of text found in the document are stored in a list and returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec81295c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_text_detection(bucket, document, access_code, secret_code, region):\n",
    "    \n",
    "    # Get the document from Amazon S3\n",
    "    s3_connection = boto3.resource(\"s3\")\n",
    "\n",
    "    # Connect to Amazon Textract to detect text in the document\n",
    "    client = boto3.client(\"textract\", region_name=region, aws_access_key_id=access_code, \n",
    "    aws_secret_access_key=secret_code)\n",
    "\n",
    "    # Get the response from Amazon S3\n",
    "    s3_object = s3_connection.Object(bucket, document)\n",
    "    s3_response = s3_object.get()\n",
    "\n",
    "    # open binary stream using an in-memory bytes buffer\n",
    "    stream = io.BytesIO(s3_response['Body'].read())\n",
    "\n",
    "    # load stream into image\n",
    "    image = Image.open(stream)\n",
    "\n",
    "    # Display the image\n",
    "    display(image)\n",
    "\n",
    "    # process using Amazon S3 object\n",
    "    response = client.detect_document_text(Document={'S3Object': {'Bucket': bucket, 'Name': document}})\n",
    "\n",
    "    # Get the text blocks\n",
    "    blocks = response['Blocks']\n",
    "\n",
    "    # List to store image lines in document\n",
    "    line_list = []\n",
    "\n",
    "    # Create image showing bounding box/polygon around the detected lines/text\n",
    "    for block in blocks:\n",
    "        if block[\"BlockType\"] == \"LINE\":\n",
    "            line_list.append(block[\"Text\"])\n",
    "\n",
    "    return line_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "682344b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "lines = process_text_detection(bucket, document, access_key, secret_key, region_name)\n",
    "print(\"Text found: \" + str(lines))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a891bc8",
   "metadata": {},
   "source": [
    "You can now send the lines you extracted from the image to Amazon Comprehend and use the DetectEntities API to find all entities within those lines. You'll need a function that iterates through the list of lines returned by the \"process_text_detection\" function you wrote earlier and calls the DetectEntities operation on every line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "395a29f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def entity_detection(lines, access_code, secret_code, region):\n",
    "    \n",
    "    # Create a list to hold the entities found for every line\n",
    "    response_entities = []\n",
    "    \n",
    "    \n",
    "    # Connect to Amazon Comprehend\n",
    "    comprehend = boto3.client(service_name='comprehend', region_name=region, aws_access_key_id=access_code, \n",
    "    aws_secret_access_key=secret_code)\n",
    "\n",
    "    \n",
    "    # Iterate through the lines in the list of lines\n",
    "    for line in lines:\n",
    "\n",
    "        # construct a list to hold all found entities for a single line\n",
    "        entities_list = []\n",
    "\n",
    "        # Call the DetectEntities operation and pass it a line from lines\n",
    "        found_entities = comprehend.detect_entities(Text=line, LanguageCode='en')\n",
    "        for response_data, values in found_entities.items():\n",
    "            for item in values:\n",
    "                if \"Text\" in item:\n",
    "                    print(\"Entities found:\")\n",
    "                    for text, val in item.items():\n",
    "                        if text == \"Text\":\n",
    "                            # Append the found entities to the list of entities\n",
    "                            entities_list.append(val)\n",
    "                            print(val)\n",
    "        # Add all found entities for this line to the list of all entities found\n",
    "        response_entities.append(entities_list)\n",
    "\n",
    "    return response_entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3018960c",
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Calling DetectEntities:')\n",
    "print(\"------\")\n",
    "response_ents = entity_detection(lines, access_key, secret_key, region_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ceca38c9",
   "metadata": {},
   "source": [
    "Now that you have a list of the lines extracted by Amazon Textract and the entities found in those lines, you can create a dataframe that lets you see both. In the code below, a Pandas dataframe is constructed, displaying the lines found in the input image and their associated entities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7cc4fc64",
   "metadata": {},
   "outputs": [],
   "source": [
    "entities_dict = {\"Lines\":lines, \"Entities\":response_ents}\n",
    "df = pd.DataFrame(entities_dict, columns=[\"Lines\",\"Entities\"])\n",
    "print(df)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
